Comparison of Binary Classification Based on Signed 
Distance Functions with Support Vector Machines 
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1 Introduction 

Efficient and accurate computational solu- 
tions for binary classification problems are 
currently of interest in many contexts, par- 
ticularly in biomedical informatics and com- 
putational biology where the interesting ge- 
nomic and proteomic data sets are imbued 
with dimensional complexity and confounded 
by noise. Over the past several years it 
has been effectively demonstrated that binary 
classification of genomic and proteomic data 
can be used to connect a molecular snap- 
shot of an individual's internal state with 
the presence or absence of a disease. This 
potential promises to revolutionize person- 
alized medicine and is fueling the develop- 
ment and analysis of robust classification al- 
gorithms. Among the existing classification 
algorithms Support Vector Machine (SVM) 
methods have distinguished themselves as ef- 
ficient, accurate and robust. Applications of 
Radial Basis Function Networks (RBFN) to 
classification have also generated attention. 

We consider only a geometric (rather than 
statistical) formulation of the binary classifi- 
cation problem. Namely, we suppose that the 
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space of measurements X is divided into two 
subsets, A and its compliment A'^ = X \ A. 
We are given data {xi} for which we know the 
membership in A or A'^ of each data point. 
From this data the binary classification prob- 
lem is to construct a rule or classifier that we 
can use to predict the class of new, unchar- 
acterized data. As an example, the data may 
be measurements of genomic activation lev- 
els, one class might be measures from individ- 
uals known to have a certain type of disease 
while the other class may be from individuals 
without the disease. 

The linear SVM method was originally de- 
signed to be geometric and robust through 
a constraint that it produce a dividing sur- 
face of maximal margin between data of op- 
posite type. However, it has been shown that 
nonlinear SVM implementations actually are 
built around reconstruction of the indicator 
function that ties the location of the data to 
its class ([2]). The indicator function: 



1 a X e A 
-1 ifxGA^ 



encodes only the most primitive geometric in- 
formation. In [2] we proposed an alternative 
tool for classification, the Signed Distance 
Function (SDF), that measures the signed 
distance from the data to the boundary be- 
tween the classes, i.e. 
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where ci is a distance function. While the 
SDF has not previously been applied to clas- 
sification, it has been an important tool in 



other fields, such as free boundary problems 
in fluid dynamics, and so has a rich mathe- 
matical development that could be exploited. 
We have tested rudimentary classification al- 
gorithms based on the idea of reconstructing 
the SDF from training data. We note that 
this reconstruction could be based on any ac- 
cepted method of regression, including SVM 
or RBFN regression. Thus, new SVM or 
RBFN classification methods could be built 
on the SDF foundation. One simple, yet ap- 
pealing, choice for the nonlinear regression is 
the least squares regression discussed in |7j|. 
We have implemented this approach in the 
algorithm for nonlinear data describe below. 

We investigate the performance of a sim- 
ple SDF based method by direct comparison 
with standard SVM packag GS, clS well as K- 
nearest neighbor and RBFN methods. We 
present experimental results comparing the 
SDF approach with other classifiers on both 
synthetic geometric problems and five bench- 
mark clinical microarray data sets. On both 
geometric problems and microarray data sets, 
the non-optimized SDF based classifiers per- 
form just as well or slightly better than well- 
developed, standard SVM methods. These 
results demonstrate the potential accuracy of 
SDF-based methods on some types of prob- 
lems. 

Algorithms. A procedure for training 
an SDF classifier is given in Algorithm 1. 
It takes guments a training data set 

{{xi,yi)}^i and a smoothing parameter 7, 
and returns a trained SDF classifier 

N 

B{x) = sign[bA{x)] = sign['^ aiK{x,Xi)]. 

i=l 

In the SDF paradigm the input training data 
are marked as to class, but they do not come 
marked with the values bA{xi), and hence 
these need to be approximated. A reason- 
able and simple first approximation of Ba at 



{xi}^i is given by 

bi = iA{xi) ■ mm{d{xi, Xj) : iA^Xj) ^ iA^Xi)}, 

(2) 

i.e. the signed projection onto the data of 
opposite type. 



Algorithm 1 



1. For each \ < k < compute 
the Pearson's correlation coefficient be- 
tween (1/1, ?/2, VnY and (xi^, X2k, ■■■XNk)- 

2. Calculate the weighted distance matrix 

D, with Dij = \lYl=i[^k{xik - Xjk)f for 
any 1 <i,j < N. 

3. Estimate the variance of the Gaus- 
sian kernel function, a, by the Root 
Mean Squared Distance (RMSD) a = 

\J N{N+1) .^1=1 l^j=i+l^ij- 

4. Calculate the Gaussian kernel matrix K, 
with Kij = K{xi,Xj) = exp{—Dfj/2a'^) for 
any 1 < i,j < N. 

5. Estimate b = {&t}^i at the training 
data {xi}^i using ([2]). 

6. Reconstruct Ba, on the entire domain 
through regression, i.e. solving the linear 
system of equations 

(K + N-fl)a = h (3) 

where I is the identity matrix. 



In the preliminary explorations on microar- 
ray data, we used the Pearson's correlation 
coefficients to rank the relevant importance 
of the features (Step 1), and use these to com- 
pute the weighted distances between cases 
(Step 2). The parameter a determines the 



width of the Gaussian functions centered at 
the data points. In Algorithm 1, a was de- 
termined based on mean data distances (Step 
3). The Gaussian kernel matrix was calcu- 
lated from the distance matrix just in the 
same way as in conventional SVM algorithms 
(Step 4). The SDF was estimated using the 
simple heuristic improvement of equation (2). 
The problem of solving ([3]) is well-posed since 
(K + A^7l) is strictly positive definite and the 
condition number will be good provided that 
iV7 is not too small (Step 6). 

The procedure for training the SDF classi- 
fier on linearly separable data is simpler and 
can be obtained by removing Step 2 and 3 
from Algorithm 1, and replacing the Gaussian 
kernel function K{xi,Xj) by the inner prod- 
uct operator {xi,Xj) 



T 

Xj Xj. 



2 Experimental Results 

In [2] we investigated these methods in 
linearly separable problems, the 4 by 4 
checkerboard problem and two cancer 
diagnosis problems involving micro-array 
data. In this manuscript we report tests 
of primitive SDF based methods on two 
new geometric problems and four additional 
cancer diagnosis problems. We compare our 
algorithm with the standard SVM methods, 
as well as the Lagrangian SVM |E] and Prox- 
imal SVM [3]. A software implementation 
of our algorithm with GUI can be found at: 



[-1, 1] X [-1, 1] into two sets, A := [0, 1] x 
[0, 1] and its complement. We choose a data 
set uniformly at random in X and solve equa- 
tion ([3]) setting the right hand side to: (a) 
the exact distances to the boundary b, (b) 
the exact values of the indicator function /. 
We then use the coefficients determined in 
each case to compute the value of the function 
1-8(0,0)1, whose value is a direct measure of 
the error in fitting the boundary between the 
classes at the corner. We considered 100 inde- 
pendent trials and calculated the mean value 
of the absolute error and its variance (Ta- 
ble [1]). The SDF average local error was an 
order of magnitude lower as was its standard 
deviation, and as the number of input data in- 
creases it decreases rapidly toward zero. This 
demonstrates the main advantage of the SDF 
over the indicator function: the SDF is much 
more suitable for the regression step. 



Data Set 


SDF Error 


IF Error 


100 points 
500 points 
1000 points 


0.0365 ± .0268 
0.0135 ± .0104 
0.0098 ± .0084 


0.5180 ± .2530 
0.4916 ± .1066 
0.4775 ± .0996 



Table 1: A comparison of the local error and 
reliability of the SDF and indicator function 
(IF) regression applied to a local checker- 
board problem. 

Biased distribution of data. We observed 
in the linear tests that the SDF classification 
^ ^^a^^..^.^.^ ^^.^ "V^v, "^r j^^Barticularly outperformed the SVM methods 
http://people.vanderDilt.edu~mmhui.xie/sdt20l)5/ ., , , , 
— on skewed data sets. We give here an expla- 
nation that illustrates a clear advantage of us- 
ing the SDF over the indicator function. Sup- 
pose that the data has more samples of one 
type than the other. If the indicator function 
is approximated then the additional data of 
one type will reinforce that value of the indi- 
cator function effectively enlarging the region 
predicted in that set. If the signed distance 
function is approximated, the distance to the 
boundary is reinforced which does not move 
the boundary. 



Checkerboard Problems. In [2] we pre- 
sented numerical results on the 4 by 4 
checkerboard problem, a geometric problem 
that is known to be hard. In those tests SDF 
based classification outperformed the best re- 
ported SVM results as well as standard SVM 
packages. 

Here we perform a simple experiment that 
sheds light on the performance advantage of 
the SDF methods in this geometrical con- 
text. In this test we divide the square X : = 



We illustrate this with a sim- 
ple example. Consider the data set 
{(0, 1), (.1, 1), (-.1, 1), (0, -1)}. The original 
SVM would place the separating line for this 
data at ?/ = 0. However, the Proximal SVM 
places it at y = —.2 and the Lagrangian 
SVM places it at ?/ = —.125. If more points 
are added near (0,1), the separating line is 
pushed further into y < 0. However, the 
SDF linear classifier places the separating 
line at 1/ = up to an error comparable to 
machine epsilon. 

Microarray Data Sets. In [2] we tested 
an SDF-based classifier on two standard ge- 
nomic data sets involving cancer diagnosis 
from microarray experiments and found the 
SDF-based classification to do as well or bet- 
ter than standard LIBSVM routines. In the 
current paper we further compare the gener- 
alization performances of SDF-based classi- 
fier versus three other types of distance-based 
classifiers: KNN, Radial Basis Function Net- 
works (RBFN) and SVM. We use the follow- 
ing microarray data sets: 



The Breast Cancer data set [TT] consists 
of 49 tumor samples with 7129 human 
genes each. There are two different re- 
sponse variables in the data set: one de- 
scribes the status of the estrogen recep- 
tor (ER), and the other one describes the 
status of the lymph nodal (LN). Of the 
49 samples, 25 are ER+ and 24 are ER-, 
25 are LN+ and 24 are LN-. 

The Colon Cancer data set pQ consists 
of 40 tumor and 22 normal colon tissues 
with 2000 genes each. 

The Leukemia data set [4] consists of 72 
samples with 7129 genes each. Each pa- 
tient represented by a sample has either 
acute lymphoblastic leukemia (ALL) or 
acute myeloid leukemia (AML). Of the 
72 samples, 47 are ALL and 25 are AML. 



We tested the four classifiers in 100 inde- 
pendent trials on each of the data sets. In 
each trial, the data set is divided randomly 
into a training set and a test set according 
to the ratio of 2:1. We use Gaussian ker- 
nel functions for RBFN, SVM and SDF. We 
claim that the classifiers are comparable in 
this setting since they are used under exactly 
the same conditions: (i) They share the same 
training set and test set in each trial, (ii) 
SVM and SDF share the same 7 = 10~^, 
(iii) SVM and SDF used the same weighted 
kernel matrix in each trial, (iv) SVM and 
RBFN used the same a, which is computed 
in each trial by the RMSD as described in 
Algorithm 1 in Section 3.4. 

Figures mil], and [3] show the boxplots of the 
test error rates in 100 trials, on the breast 
cancer data set, the colon cancer data set, 
and the leukemia data set, respectively. In 
the response variable is ER and in 
the response variable is LN. Since 



1 a' 



1(b) 



Figure 
Figure 

the sample size of each data set is less than 
80, KNN with a large number of neighbors 
might not achieve good performance due to 
overfitting. Hence in our experiments we test 
KNN only with k from 1 to 10. 



Data Set 


KNN 


RBFN 


SVM 


SDF 


Breast (ER) 


.0912 


.0912 


.0869 


.0869 


Breast (LN) 


.2400 


.2425 


.2106 


.2100 


Colon 


.2200 


.2143 


.1700 


.1662 


Leukemia 


.0146 


.0321 


.0167 


.0167 



Table 2: Comparison of misclassification 
rates averaged over 100 trials on randomly 
divided data. 

Table [1] shows the test error rates averaged 
over the 100 independent trials for each classi- 
fier. KNN with 1, 9, 3, 5, 5 neighbors achieves 
the best (in the averaging sense) generaliza- 
tion performance for the breast cancer data 
(ER), breast cancer data (LN), colon cancer 
data, colon cancer data with 5 samples re- 
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Comparison of % Misclassification for Breast Cancer Data (LN) 
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Figure 1: Comparison misclassification rates 
for breast cancer data. 

Comparison of % Misclassification for Colon Data 
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Figure 2: Comparison of misclassification 
rates for colon data. 
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Figure 3: Comparison of misclassification 
rates for leukemia data. 

moved, and the leukemia data, respectively. 
In the table we only keep the averaged test 
error rates for the best KNN. 

From the boxplots and the averaged test 
error rates, we can see that the performances 
of KNN and RBFN on the microarray data 
sets are generally not as good as those of SVM 
and SDF. The performance of the SDF clas- 
sifier matches that of the SVM method on 
the breast cancer data (ER) and the leukemia 
data, exceeds it on the breast cancer data 
(LN) and the colon cancer data. 

3 Conclusions 

From the experimental results shown above, 
SDF-based classifiers promise to be more ac- 
curate than current generation classification 
methods and just as efficient computation- 
ally. Because it is geometrical, SDF-based 
nonlinear classification is theoretically a more 
faithful and natural generalization of the orig- 
inal SVM concept than existing nonlinear 
SVM implementations. The algorithm pre- 
sented in the paper and used in the exper- 
iments is the most naive version, so there 
are many other directions of future devel- 
opment that need to be followed. These 
might include investigation of different meth- 
ods for initialization of from the train- 



ing data, optimization of 6^ through various 
iteration schemes, exploration of the use of 
different regression techniques, such as the 
SVM and RBFN regression, to reconstruct 
Ba from and so on. Due to the im- 

portance of biological data sets which usually 
have thousands of features, we would also try 
various dimension reduction techniques, such 
as Principal Component Analysis (PCA), In- 
dependent Component Analysis (ICA), and 
Factor Analysis to preprocess the data and 
analyze the interactions between dimension- 
ality and performance of the SDF-based clas- 
sifiers. Finally, as pointed out in [2] the use 
of the SDF in other applications and its rela- 
tionship to deep mathematical results might 
be exploited to both improve implementa- 
tions and provide rigorous validation of the 
process. 
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